{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial 03: Running RLlib Experiments\n",
    "\n",
    "This tutorial walks you through the process of running traffic simulations in Flow with trainable RLlib-powered agents. Autonomous agents will learn to maximize a certain reward over the rollouts, using the [**RLlib**](https://ray.readthedocs.io/en/latest/rllib.html) library ([citation](https://arxiv.org/abs/1712.09381)) ([installation instructions](https://flow.readthedocs.io/en/latest/flow_setup.html#optional-install-ray-rllib)). Simulations of this form will depict the propensity of RL agents to influence the traffic of a human fleet in order to make the whole fleet more efficient (for some given metrics). \n",
    "\n",
    "In this exercise, we simulate an initially perturbed single lane ring road, where we introduce a single autonomous vehicle. We witness that, after some training, that the autonomous vehicle learns to dissipate the formation and propagation of \"phantom jams\" which form when only human driver dynamics are involved.\n",
    "\n",
    "## 1. Components of a Simulation\n",
    "All simulations, both in the presence and absence of RL, require two components: a *network*, and an *environment*. Networks describe the features of the transportation network used in simulation. This includes the positions and properties of nodes and edges constituting the lanes and junctions, as well as properties of the vehicles, traffic lights, inflows, etc... in the network. Environments, on the other hand, initialize, reset, and advance simulations, and act as the primary interface between the reinforcement learning algorithm and the network. Moreover, custom environments may be used to modify the dynamical features of an network. Finally, in the RL case, it is in the *environment* that the state/action spaces and the reward function are defined. \n",
    "\n",
    "## 2. Setting up a Network\n",
    "Flow contains a plethora of pre-designed networks used to replicate highways, intersections, and merges in both closed and open settings. All these networks are located in flow/networks. For this exercise, which involves a single lane ring road, we will use the network `RingNetwork`.\n",
    "\n",
    "### 2.1 Setting up Network Parameters\n",
    "\n",
    "The network mentioned at the start of this section, as well as all other networks in Flow, are parameterized by the following arguments: \n",
    "* name\n",
    "* vehicles\n",
    "* net_params\n",
    "* initial_config\n",
    "\n",
    "These parameters are explained in detail in exercise 1. Moreover, all parameters excluding vehicles (covered in section 2.2) do not change from the previous exercise. Accordingly, we specify them nearly as we have before, and leave further explanations of the parameters to exercise 1.\n",
    "\n",
    "We begin by choosing the network the experiment will be trained on. We use one of Flow's builtin networks, located in `flow.networks`. A list of all available netowrks can be found by running the script below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import flow.networks as networks\n",
    "\n",
    "print(networks.__all__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial, we choose to use the ring road network. The network class is then:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from flow.networks import RingNetwork\n",
    "\n",
    "# ring road network class\n",
    "network_name = RingNetwork"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One key difference between SUMO and RLlib experiments is that, in RLlib experiments, the network classes do not need to be defined; instead users should simply name the network class they wish to use. Later on, an environment setup module will import the correct network class based on the provided names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# input parameter classes to the network class\n",
    "from flow.core.params import NetParams, InitialConfig\n",
    "\n",
    "# name of the network\n",
    "name = \"training_example\"\n",
    "\n",
    "# network-specific parameters\n",
    "from flow.networks.ring import ADDITIONAL_NET_PARAMS\n",
    "net_params = NetParams(additional_params=ADDITIONAL_NET_PARAMS)\n",
    "\n",
    "# initial configuration to vehicles\n",
    "initial_config = InitialConfig(spacing=\"uniform\", perturbation=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Adding Trainable Autonomous Vehicles\n",
    "The `Vehicles` class stores state information on all vehicles in the network. This class is used to identify the dynamical features of a vehicle and whether it is controlled by a reinforcement learning agent. Morover, information pertaining to the observations and reward function can be collected from various `get` methods within this class.\n",
    "\n",
    "The dynamics of vehicles in the `Vehicles` class can either be depicted by sumo or by the dynamical methods located in flow/controllers. For human-driven vehicles, we use the IDM model for acceleration behavior, with exogenous gaussian acceleration noise with std 0.2 m/s2 to induce perturbations that produce stop-and-go behavior. In addition, we use the `ContinousRouter` routing controller so that the vehicles may maintain their routes closed networks.\n",
    "\n",
    "As we have done in exercise 1, human-driven vehicles are defined in the `VehicleParams` class as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# vehicles class\n",
    "from flow.core.params import VehicleParams\n",
    "\n",
    "# vehicles dynamics models\n",
    "from flow.controllers import IDMController, ContinuousRouter\n",
    "\n",
    "vehicles = VehicleParams()\n",
    "vehicles.add(\"human\",\n",
    "             acceleration_controller=(IDMController, {}),\n",
    "             routing_controller=(ContinuousRouter, {}),\n",
    "             num_vehicles=21)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above addition to the `Vehicles` class only accounts for 21 of the 22 vehicles that are placed in the network. We now add an additional trainable autuonomous vehicle whose actions are dictated by an RL agent. This is done by specifying an `RLController` as the acceleraton controller to the vehicle. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from flow.controllers import RLController"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that this controller serves primarirly as a placeholder that marks the vehicle as a component of the RL agent, meaning that lane changing and routing actions can also be specified by the RL agent to this vehicle.\n",
    "\n",
    "We finally add the vehicle as follows, while again using the `ContinuousRouter` to perpetually maintain the vehicle within the network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vehicles.add(veh_id=\"rl\",\n",
    "             acceleration_controller=(RLController, {}),\n",
    "             routing_controller=(ContinuousRouter, {}),\n",
    "             num_vehicles=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Setting up an Environment\n",
    "\n",
    "Several environments in Flow exist to train RL agents of different forms (e.g. autonomous vehicles, traffic lights) to perform a variety of different tasks. The use of an environment allows us to view the cumulative reward simulation rollouts receive, along with to specify the state/action spaces.\n",
    "\n",
    "Envrionments in Flow are parametrized by three components:\n",
    "* env_params\n",
    "* sumo_params\n",
    "* network\n",
    "\n",
    "### 3.1 SumoParams\n",
    "`SumoParams` specifies simulation-specific variables. These variables include the length of any simulation step and whether to render the GUI when running the experiment. For this example, we consider a simulation step length of 0.1s and activate the GUI. \n",
    "\n",
    "**Note** For training purposes, it is highly recommanded to deactivate the GUI in order to avoid global slow down. In such case, one just needs to specify the following: `render=False`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from flow.core.params import SumoParams\n",
    "\n",
    "sumo_params = SumoParams(sim_step=0.1, render=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 EnvParams\n",
    "\n",
    "`EnvParams` specifies environment and experiment-specific parameters that either affect the training process or the dynamics of various components within the network. For the environment `WaveAttenuationPOEnv`, these parameters are used to dictate bounds on the accelerations of the autonomous vehicles, as well as the range of ring lengths (and accordingly network densities) the agent is trained on.\n",
    "\n",
    "Finally, it is important to specify here the *horizon* of the experiment, which is the duration of one episode (during which the RL-agent acquire data). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from flow.core.params import EnvParams\n",
    "\n",
    "# Define horizon as a variable to ensure consistent use across notebook\n",
    "HORIZON=100\n",
    "\n",
    "env_params = EnvParams(\n",
    "    # length of one rollout\n",
    "    horizon=HORIZON,\n",
    "\n",
    "    additional_params={\n",
    "        # maximum acceleration of autonomous vehicles\n",
    "        \"max_accel\": 1,\n",
    "        # maximum deceleration of autonomous vehicles\n",
    "        \"max_decel\": 1,\n",
    "        # bounds on the ranges of ring road lengths the autonomous vehicle \n",
    "        # is trained on\n",
    "        \"ring_length\": [220, 270],\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3 Initializing a Gym Environment\n",
    "\n",
    "Now, we have to specify our Gym Environment and the algorithm that our RL agents will use. Similar to the network, we choose to use on of Flow's builtin environments, a list of which is provided by the script below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import flow.envs as flowenvs\n",
    "\n",
    "print(flowenvs.__all__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use the environment \"WaveAttenuationPOEnv\", which is used to train autonomous vehicles to attenuate the formation and propagation of waves in a partially observable variable density ring road. To create the Gym Environment, the only necessary parameters are the environment name plus the previously defined variables. These are defined as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from flow.envs import WaveAttenuationPOEnv\n",
    "\n",
    "env_name = WaveAttenuationPOEnv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.4 Setting up Flow Parameters\n",
    "\n",
    "RLlib experiments both generate a `params.json` file for each experiment run. For RLlib experiments, the parameters defining the Flow network and environment must be stored as well. As such, in this section we define the dictionary `flow_params`, which contains the variables required by the utility function `make_create_env`. `make_create_env` is a higher-order function which returns a function `create_env` that initializes a Gym environment corresponding to the Flow network specified."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Creating flow_params. Make sure the dictionary keys are as specified. \n",
    "flow_params = dict(\n",
    "    # name of the experiment\n",
    "    exp_tag=name,\n",
    "    # name of the flow environment the experiment is running on\n",
    "    env_name=env_name,\n",
    "    # name of the network class the experiment uses\n",
    "    network=network_name,\n",
    "    # simulator that is used by the experiment\n",
    "    simulator='traci',\n",
    "    # sumo-related parameters (see flow.core.params.SumoParams)\n",
    "    sim=sumo_params,\n",
    "    # environment related parameters (see flow.core.params.EnvParams)\n",
    "    env=env_params,\n",
    "    # network-related parameters (see flow.core.params.NetParams and\n",
    "    # the network's documentation or ADDITIONAL_NET_PARAMS component)\n",
    "    net=net_params,\n",
    "    # vehicles to be placed in the network at the start of a rollout \n",
    "    # (see flow.core.vehicles.Vehicles)\n",
    "    veh=vehicles,\n",
    "    # (optional) parameters affecting the positioning of vehicles upon \n",
    "    # initialization/reset (see flow.core.params.InitialConfig)\n",
    "    initial=initial_config\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4 Running RL experiments in Ray\n",
    "\n",
    "### 4.1 Import \n",
    "\n",
    "First, we must import modules required to run experiments in Ray. The `json` package is required to store the Flow experiment parameters in the `params.json` file, as is `FlowParamsEncoder`. Ray-related imports are required: the PPO algorithm agent, `ray.tune`'s experiment runner, and environment helper methods `register_env` and `make_create_env`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "import ray\n",
    "try:\n",
    "    from ray.rllib.agents.agent import get_agent_class\n",
    "except ImportError:\n",
    "    from ray.rllib.agents.registry import get_agent_class\n",
    "from ray.tune import run_experiments\n",
    "from ray.tune.registry import register_env\n",
    "\n",
    "from flow.utils.registry import make_create_env\n",
    "from flow.utils.rllib import FlowParamsEncoder"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 Initializing Ray\n",
    "Here, we initialize Ray and experiment-based constant variables specifying parallelism in the experiment as well as experiment batch size in terms of number of rollouts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# number of parallel workers\n",
    "N_CPUS = 2\n",
    "# number of rollouts per training iteration\n",
    "N_ROLLOUTS = 1\n",
    "\n",
    "ray.init(num_cpus=N_CPUS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3 Configuration and Setup\n",
    "Here, we copy and modify the default configuration for the [PPO algorithm](https://arxiv.org/abs/1707.06347). The agent has the number of parallel workers specified, a batch size corresponding to `N_ROLLOUTS` rollouts (each of which has length `HORIZON` steps), a discount rate $\\gamma$ of 0.999, two hidden layers of size 16, uses Generalized Advantage Estimation, $\\lambda$ of 0.97, and other parameters as set below.\n",
    "\n",
    "Once `config` contains the desired parameters, a JSON string corresponding to the `flow_params` specified in section 3 is generated. The `FlowParamsEncoder` maps objects to string representations so that the experiment can be reproduced later. That string representation is stored within the `env_config` section of the `config` dictionary. Later, `config` is written out to the file `params.json`. \n",
    "\n",
    "Next, we call `make_create_env` and pass in the `flow_params` to return a function we can use to register our Flow environment with Gym. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The algorithm or model to train. This may refer to \"\n",
    "#      \"the name of a built-on algorithm (e.g. RLLib's DQN \"\n",
    "#      \"or PPO), or a user-defined trainable function or \"\n",
    "#      \"class registered in the tune registry.\")\n",
    "alg_run = \"PPO\"\n",
    "\n",
    "agent_cls = get_agent_class(alg_run)\n",
    "config = agent_cls._default_config.copy()\n",
    "config[\"num_workers\"] = N_CPUS - 1  # number of parallel workers\n",
    "config[\"train_batch_size\"] = HORIZON * N_ROLLOUTS  # batch size\n",
    "config[\"gamma\"] = 0.999  # discount rate\n",
    "config[\"model\"].update({\"fcnet_hiddens\": [16, 16]})  # size of hidden layers in network\n",
    "config[\"use_gae\"] = True  # using generalized advantage estimation\n",
    "config[\"lambda\"] = 0.97  \n",
    "config[\"sgd_minibatch_size\"] = min(16 * 1024, config[\"train_batch_size\"])  # stochastic gradient descent\n",
    "config[\"kl_target\"] = 0.02  # target KL divergence\n",
    "config[\"num_sgd_iter\"] = 10  # number of SGD iterations\n",
    "config[\"horizon\"] = HORIZON  # rollout horizon\n",
    "\n",
    "# save the flow params for replay\n",
    "flow_json = json.dumps(flow_params, cls=FlowParamsEncoder, sort_keys=True,\n",
    "                       indent=4)  # generating a string version of flow_params\n",
    "config['env_config']['flow_params'] = flow_json  # adding the flow_params to config dict\n",
    "config['env_config']['run'] = alg_run\n",
    "\n",
    "# Call the utility function make_create_env to be able to \n",
    "# register the Flow env for this experiment\n",
    "create_env, gym_name = make_create_env(params=flow_params, version=0)\n",
    "\n",
    "# Register as rllib env with Gym\n",
    "register_env(gym_name, create_env)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.4 Running Experiments\n",
    "\n",
    "Here, we use the `run_experiments` function from `ray.tune`. The function takes a dictionary with one key, a name corresponding to the experiment, and one value, itself a dictionary containing parameters for training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "trials = run_experiments({\n",
    "    flow_params[\"exp_tag\"]: {\n",
    "        \"run\": alg_run,\n",
    "        \"env\": gym_name,\n",
    "        \"config\": {\n",
    "            **config\n",
    "        },\n",
    "        \"checkpoint_freq\": 1,  # number of iterations between checkpoints\n",
    "        \"checkpoint_at_end\": True,  # generate a checkpoint at the end\n",
    "        \"max_failures\": 999,\n",
    "        \"stop\": {  # stopping conditions\n",
    "            \"training_iteration\": 1,  # number of iterations to stop after\n",
    "        },\n",
    "    },\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.5 Visualizing the results\n",
    "\n",
    "The simulation results are saved within the `ray_results/training_example` directory (we defined `training_example` at the start of this tutorial). The `ray_results` folder is by default located at your root `~/ray_results`. \n",
    "\n",
    "You can run `tensorboard --logdir=~/ray_results/training_example` (install it with `pip install tensorboard`) to visualize the different data outputted by your simulation.\n",
    "\n",
    "For more instructions about visualizing, please see `tutorial05_visualize.ipynb`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.6 Restart from a checkpoint / Transfer learning\n",
    "\n",
    "If you wish to do transfer learning, or to resume a previous training, you will need to start the simulation from a previous checkpoint. To do that, you can add a `restore` parameter in the `run_experiments` argument, as follows:\n",
    "\n",
    "```python\n",
    "trials = run_experiments({\n",
    "    flow_params[\"exp_tag\"]: {\n",
    "        \"run\": alg_run,\n",
    "        \"env\": gym_name,\n",
    "        \"config\": {\n",
    "            **config\n",
    "        },\n",
    "        \"restore\": \"/ray_results/experiment/dir/checkpoint_50/checkpoint-50\"\n",
    "        \"checkpoint_freq\": 1,\n",
    "        \"checkpoint_at_end\": True,\n",
    "        \"max_failures\": 999,\n",
    "        \"stop\": {\n",
    "            \"training_iteration\": 1,\n",
    "        },\n",
    "    },\n",
    "})\n",
    "```\n",
    "\n",
    "The `\"restore\"` path should be such that the `[restore]/.tune_metadata` file exists.\n",
    "\n",
    "There is also a `\"resume\"` parameter that you can set to `True` if you just wish to continue the training from a previously saved checkpoint, in case you are still training on the same experiment. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
